Parallelizing a for loop

Introduction

Let us try to demonstrate using some example where a task can be parallelized.
Before that let us see syntax of kernel definition.

Syntax of kernel definition

__global__ void copy_fun(int *data_out, int *data_in) //__global__ keyword is declaration specifier which tells the compiler to execute the kernel only on device(GPU). Kernels called from host, never return a value, so the return type must be void. Variables operated on kernel must be passed by pointers. (as it can't return the output to CPU directly). 
{
// the simplest kernel would be to just copy the 0th element of the array:
data_out[0]=data_in[0];
//this kernel would be executed for each thread.

}

int main()
{
int N=1000000000;
dim3 grid_size(1); //grid with just 1 block in x direction.
dim3 block_size(N); //N threads in that one block.

int *h_data, *d_data;

//let us say that we have some array of 1000000000, numbers. and we want to increse the each element of the array with value 1. Now for cpu this program just takes the following for loop

for (int i=0; i<N; i++)
{
	h_data[i]+=1;
}

//now the above problem can be parallelized.

}


Let us see some examples where task can be parallelized.

Example 1.

Task: Modifying each element of the array.

CPU code

#include<iostream>
#include<vector>
using std::vector;

void modify_vec(int *p, int N)
{
for (int i=0;i<N;i++)
{
p[i]*=2; //let us say we are multiplying each element by 2.
}
}

int main()
{

int N=1000000000;
int arr[N]; //declaring array of size N

for(int i=0;i<N;i++)
{
arr[i]=i;
}

modify_vec(&arr[0],N);
for (int i=0;i<N;i++)
{
std::cout<<i<<std::endl;
}

return 0;
}